Learning hierarchical policies from human ratings
نویسنده
چکیده
Robots are on the verge of becoming ubiquitous. In the form of affordable humanoid toy robots, autonomous cars, vacuum robots or quadrocopters, robots are becoming part of our everyday life. As of today, most of these robots still follow largely hard coded behavior routines. Constraining a robot’s behavior to pre-programmed routines, however, limits its potential in several important ways. For example, programming even simple behavior patterns is a challenging task and programming behavior with human like performance by hand seems impossible. The goal of this thesis, thus, is to develop methods which allow robots to learn solutions to tasks through trial and error instead of relying on manual programming. These learned solutions should fulfill a range of desired properties. Foremost, the solutions should be learned on real world robots and not be constrained to simplified simulation environments. Furthermore, we would like the robot to learn versatile solutions which are able to cope with different variations of a task. Finally, we would like to be able to also solve more complicated tasks which require sequencing of multiple skills. In the first part of this thesis, we propose an algorithm to learn such versatile solutions. The proposed algorithm aims to find a hierarchical policy, consisting of a gating policy and a set of sub-policies. The gating policy selects a sub-policy and the sub-policy decides which action to take. Each of the learned sub-policies may be able to solve the task or a part of the overall task. To learn multiple sub-policies from one sample set, we employ an expectation-maximization based learning algorithm, where the sub-policies are updated according to their responsibilities for individual samples. These responsibilities indicate how likely it is that a certain sub-policy generated a given state-action pair. By constraining our learning algorithm to solutions which increase the entropy of the responsibilities, the robot will learn a set of sub-policies that encode different solutions for the same task. This kind of concurrency is highly desirable as it allows the robot to learn back-up solutions which may still be valid even if the original solution fails due to changes in the robot or environment. In the second part of this thesis, we tackle the challenge of learning skills directly from stateaction trajectories. A common approach in robot learning is to assume access to some sort of parametrized skill to represent the robot’s behavior over multiple time steps. These skills are usually either movement primitives (MPs) or parametrized feedback controllers. While especially MPs have been a cornerstone in advancing the state of the art in robot learning, it is not clear how to learn from the actions taken throughout the execution of a skill and MPs usually do not encode feedback. In the discrete state-action reinforcement learning (RL) setting, macro-actions, or options, have been introduced to address the problem of learning temporally correlated actions, which can be viewed as a form of skill. To connect robot learning with the advances in the field of discrete state-action RL, we propose a probabilistic framework to infer options from stateaction trajectory observations. The inference is based on a hidden Markov model (HMM), where the options indices are modeled as latent variables and where inference can be performed by adapting well known expectation-maximization algorithms. Because this framework allows for the inference of parametric policies, it is also compatible with policy search (PS) methods, a class of RL algorithms which is at the core of many recent successes in robot learning. Learning methods, such as the one we propose in the first chapter of this thesis, enable robots to learn from trial and error instead of having to program specific solutions. Unfortunately, the
منابع مشابه
Active Imitation Learning of Hierarchical Policies
In this paper, we study the problem of imitation learning of hierarchical policies from demonstrations. The main difficulty in learning hierarchical policies by imitation is that the high level intention structure of the policy, which is often critical for understanding the demonstration, is unobserved. We formulate this problem as active learning of Probabilistic State-Dependent Grammars (PSDG...
متن کاملCombining Hierarchical Reinforcement Learning and Bayesian Networks for Natural Language Generation in Situated Dialogue
Language generators in situated domains face a number of content selection, utterance planning and surface realisation decisions, which can be strictly interdependent. We therefore propose to optimise these processes in a joint fashion using Hierarchical Reinforcement Learning. To this end, we induce a reward function for content selection and utterance planning from data using the PARADISE fra...
متن کاملAlgorithms for Batch Hierarchical Reinforcement Learning
Hierarchical Reinforcement Learning (HRL) exploits temporal abstraction to solve large Markov Decision Processes (MDP) and provide transferable subtask policies. In this paper, we introduce an off-policy HRL algorithm: Hierarchical Q-value Iteration (HQI). We show that it is possible to effectively learn recursive optimal policies for any valid hierarchical decomposition of the original MDP, gi...
متن کاملHierarchical Relative Entropy Policy Search
Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Re...
متن کاملThe Position of the human rights components in the contents of Iran elementary education Textbooks
The present study is about one of the most important contemporary issues in education and curriculum development, namely “human rights education”. By using content analysis, 36 textbooks of 2012-2013 school year with an overall of 3924 pages were studied and analyzed. For the analysis of the data, Shannon's entropy method derived from the theory of systems was used to obtain the credibility rat...
متن کاملیادگیری تیمی اعضای هیأت علمی؛ یک گام به سمت دانشگاههای یادگیرنده
Introduction: Today, team learning is a perfect solution to respond to challenges such as human resource management and competitive challenges and to achieve organizational effectiveness. The aim of research is to study team learning activity of faculty members of health information technology in universities of medical science whole country. Methods: Faculties of Universities of Medical scien...
متن کامل